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Abstract. Inspired by concepts from ergodic theory, we give new in- 
sight into coding sequence (CDS) density estimation for the human 
genome. Our approach is based on the introduction and study of topo- 
logical pressure: a numerical quantity assigned to any finite sequence 
based on an appropriate notion of 'weighted information content'. For 
human DNA sequences, each codon is assigned a suitable weight, and 
using a window size of approximately 60,000bp, we obtain a very strong 
positive correlation between CDS density and topological pressure. The 
weights are selected by an optimization procedure, and can be inter- 
preted as quantitative data on the relative importance of different codons 
for the density estimation of coding sequences. This gives new insight 
into codon usage bias which is an important subject where long stand- 
ing questions remain open. Inspired again by ergodic theory, we use the 
weightings on the codons to define a probability measure on finite se- 
quences. We demonstrate that this measure is effective in distinguishing 
between coding and non-coding human DNA sequences of lengths ap- 
proximately 5,000bp. We emphasize that topological pressure is a flexi- 
ble tool and we expect it to be useful for the investigation of many other 
features of DNA sequences such as interspecies comparison of codon us- 
age bias. We give a first result in this direction, investigating CDS 
density in the mouse genome and comparing our results with those for 
the human genome. 



1. Introduction 

The theory of symbolic dynamical systems is a rich and only partially 
explored source of techniques for genomic analysis. We introduce a new tool 
called topological pressure (or simply pressure), which we apply to the study 
of the human genome. Pressure can be interpreted as a weighted measure of 
complexity and is the natural generalization of topological entropy for finite 
sequences introduced by the first named author in [24] . The primary goals 
of our analysis are to demonstrate how pressure can predict the distribution 
of coding sequences across a genome and to use this to recover quantitative 
data on codon usage bias. This could shed light on the issue of mammalian 
codon bias, where it is recognized that a complete understanding has not 
yet been achieved [2DJ [3S] . 
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Topological pressure is a well known and well studied concept in the 
ergodic theory of dynamical systems. The standard version is a quantity 
associated to a topological dynamical system which measures weighted orbit 
complexity of the system |31 4 132 } 142]. We introduce a finite implementation 
of topological pressure which can be interpreted as a measure of weighted 
information content of a finite sequence. The pressure of a finite sequence 
is given by counting (with weights) all subwords of an exponentially shorter 
length that appear in the original word. Each subword is weighted through 
the use of a function cp, which we call the potential. We focus on potentials 
which depend on only 3 symbols, so cp is essentially a choice of weighting for 
each codon. Pressure detects a trade-off between complexity in the sequence 
and frequency of occurrence of 'favored' codons. This intuition is made 
rigorous by the Variational Principle from ergodic theory [22], which we 
recall in Q5.21 If a potential can be found so that the pressure correlates 
strongly with an observed biological phenomena, this gives evidence for the 
biological importance of those codons which are weighted strongly by that 
potential. 

Our main focus is the selection of potentials for which the topological pres- 
sure correlates strongly with the observed distribution of coding sequences 
when using windows of size approximately 60, OOObp. We optimize this cor- 
relation independently on each chromosome with respect to the parameters 
of the potential and find that the Pearson's correlation coefficient between 
the coding sequence density and the topological pressure is above 0.9. The 
parameters obtained on each chromosome are close together (at least for 
the autosomes) so we average them to obtain a 'canonical' potential for 
CDS density estimation which we denote by <p^ B . We check that a similar 
potential is obtained when we optimize the correlation across the whole hu- 
man genome simultaneously. We give a detailed analysis of the pattern of 
codon weights given by the potential ip^ and observe a number of striking 
qualitative features that contribute to the investigation of codon usage bias. 

Recent research [33l [TJ EJ EJ [121 [30l [38] has focused on analyzing the nu- 
anced and oft-debated question of the nature and cause of codon usage bias. 
While many studies have successfully analyzed a particular influence on 
synonymous codon usage (context dependency [13], GC content [8], tRNA 
adaptation [12], etc.), a comprehensive understanding of codon bias (par- 
ticularly in mammals) remains a challenging open problem. The difficulty 
in analyzing codon bias can be attributed partly to an over-abundance of 
plausible statistical and theoretical approaches which are often mutually 
contradictory [33J. Furthermore, there is general disagreement on how to 
properly take into consideration features of the sequence such as GC content 
and context dependencies, as well as the inherent randomness of nucleotide 
composition. For example, it has been argued that the codon adaptation 
index [33] should [36J and should not [19] be used to determine the influence 
of synonymous codon usage on gene expression levels. 
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The advantage of utilizing topological pressure is its relative simplicity. 
The definition is entirely combinatorial and implicitly takes account of im- 
portant considerations such as neighboring dependencies, different choices 
of reading frame, autocorrelation, background codon frequencies, and GC 
content. Furthermore, for sequences of suitably large scale, the definition is 
robust enough to absorb the inherent randomness and noise in nucleotide 
composition. 

These advantages allow us to compare synonymous codon usage between 
species and its relationship with CDS density estimation. Using the poten- 
tial (fhs, we show that pressure has good correlation with the CDS distri- 
bution of mus musculus. In addition, we optimize the correlation of CDS 
density with pressure over the mouse genome. We observe that the poten- 
tial obtained this way shares many qualitative features with This gives 
evidence that the parameters in ip^ s are biologically meaningful and is a first 
step in the investigation of interspecies codon usage via topological pressure. 

Inspired once more by the techniques of ergodic theory, we demonstrate 
that any potential ip canonically defines a probability measure on finite se- 
quences via the Variational Principle. This measure, called the equilibrium 
measure for cp, reflects the properties of the potential and can be used to 
analyze sequences that are orders of magnitude shorter than those on which 
pressure is utilized. This represents a strategy in which large scale infor- 
mation (pressure) can be utilized to extract information at a much smaller 
scale (measure of a sequence). The development of robust techniques that 
detect the coding potential of short sequences is an important area of re- 
search [TQl H31 [16l [18l [261 ETl EH S3] with applications to sequence annotation 
as well as gene prediction. It has been recognized that measures of coding 
potential based on single sequence nucleotide composition [271 P-1281] are 
an important part of the problem of differentiating between short reads of 
coding and non-coding sequences and are complementary to the very effec- 
tive comparative techniques developed in, for example, [43]. We contribute 
to this line of research by showing that the equilibrium measure associated 
with t^hs can effectively distinguish between randomly selected introns and 
exons in the human genome. 

The layout of the paper is as follows: In £[2j we define topological pres- 
sure for finite sequences. In we investigate the correlation of topological 
pressure and CDS density in the human genome. In §4[ we briefly investi- 
gate applications of topological pressure to the mouse genome. In we 
demonstrate how topological pressure defines a measure on finite sequences, 
and show that this measure can distinguish between coding sequences and 
non-coding sequences. 

2. Topological pressure for finite sequences 

We rigorously develop our implementation of topological pressure for any 
finite sequence. Let A be our alphabet, that is, a finite collection of symbols. 
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Since our application is to the study of DNA sequences, we mainly consider 
the alphabet A = {A, C, T, G}. We consider various spaces of sequences on 
the alphabet A. We denote the space of sequences of length n by A n , the 
space of finite sequences (of any length) „4 <N , the space of finite sequences 
of length at least n by A- n and the space of infinite sequences by £ = „4 N . 
For w = (wi,W2, • • •) G £ or w = (wi,w 2 , • • • , w m ) G A- n , let iiu™ denote the 
finite word {w\, . . . ,w n ). For w G A n , let [w] be the set of sequences v G £ 
so that Vi = w. Let a be the shift map: For w = (wi,w 2 ,w 3 , ■ ■ ■ ) G -4- 2 U£, 
a((w 1 ,w 2 ,w 3 , ...)) = (,w 2 ,w 3 , ...). 

When we consider norms of matrices M = (rriij) and vectors v = (vi), we 
consistently use the sum norm, so that ||M|| = Ylij \ m ij\ an d IMI = Yli \ v i\- 

Definition 2.1. We say a function ip on £ (or on A- m ) depends on the 
first m symbols of a word if 

(1) For all v G A m , the restriction of ip to [v] is a constant function. 

(2) There exists w G A" 1 ^ 1 for which the restriction of ip to [w] is not a 
constant function. 

If ip depends on m symbols, then for v G A m , we write ip(v) for the common 
value of ip on [v] . 

We define topological pressure for finite sequences w G A-^. Define 

SW n (w) = {u : \u\ = n and u C w}. 

The definition depends on the cardinality of the alphabet. To keep the 
presentation close to our applications, we give the definitions under the 
assumption that jfA = 4. For alphabets of different cardinality, we simply 
replace the occurrences of 4 with jpA. 

Definition 2.2. For a word w such that \w\ = 4™ + n — 1 and a poten- 
tial function ip which depends on m symbols, where n > m, we define the 
topological pressure of ip on w to be 

(2.1) P(w, ip) = ^\og 4 p(w,ip), 
where 

n—m 

(2.2) p(w,iP)= Yl exp £ VO^u)- 

ueSW n (w) i=0 

We denote the greatest topological pressure for such words by 

(2.3) Pmax(n, ip) = max{P(w;, ip) : \w\ = 4 n + n - 1}. 

Remark. When ip = logt/9 (log denotes natural logarithm) for a function 

<P > o, 



^ / n—m \ 

(2.4) P(w, log <p) = - log 4 II ^) I • 

\ueSWn(w) i=0 
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We extend this definition to words of an arbitrary finite length. 

Definition 2.3. For a word w with 4 n + n — 1 < \w\ < 4 n+1 + n, we define 
the topological pressure of ' tp on w to be 

(2.5) P(w,4>) =P(wf +n - 1 ^). 

That is, the pressure of ip on w is defined to be the pressure of ip on the first 
4" + n — 1 symbols of w. 

Remark. An elementary argument given in [23] shows that for each n, there 
exists a word v n of length 4 n + n — 1 which has every word of length n as a 
subword. It follows that -P m ax(V') n ) = P(v n ,ip) for any function ip. 

Remark. When ip = 0, (12. ip reduces to the definition of topological entropy 
for finite sequences due to the first named author in |24j . The reason we 
take the logarithm in base 4 in (12. ip is so that -P m ax( n >0) = 1. 

2.1. Normalization of potentials. An arbitrary potential ip = log tp can 

be normalized by the addition of a constant. This is useful for a number 
of reasons, and does not affect the quantities associated to pressure that we 
study in this paper, such as the equilibrium measures introduced in $5] and 
correlation with the CDS density developed in £j3j For any t > 0, we have 
the formula 

(2.6) P(w, log tip) = log 4 1 + P(w, log if). 

n 

This allows for a variety of normalizations. For us, the most useful normal- 
ization is to let t = Hvll -1 so that tip is described by a probability vector. 
We use this normalization frequently in §2.31 

2.2. Interpretation of high pressure sequences. A sequence with high 
pressure has a good mix of complexity and frequency of 'favored' codons. 
When using the potential (i.e. entropy), we simply detect high complexity. 
In [23] , it was shown that an intron region of a DNA sequence tends to have 
higher entropy than an exon region. This is due to the exons having more 
structure (and hence less randomness). However, for windows of larger size, 
which may contain numerous intron and exon regions, entropy is a poor 
indicator of CDS density (see figure [1]). In §3.51 we demonstrate that with 
an appropriate choice of potential, the high pressure sequences correlate 
very well with those with high coding sequence density. Further insight into 
the meaning of pressure is given by the Variational Principle from ergodic 
theory, which we recall in §5.21 The Variational Principle makes precise the 
intuition that high pressure sequences are those that balance high complexity 
against high frequency of favored codons. 

2.3. Selection of the Potential. Two perspectives can be taken regarding 
selection of the potential <p. The first perspective is to obtain a potential 
via maximizing the correlation of topological pressure with a given set of 
biological data. We take this approach in section $3] to select potentials 
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based on the correlation of pressure with the probability distribution of 
known coding sequences. 

The second perspective is to construct a potential based on known biolog- 
ical phenomena and then utilize topological pressure to analyze the desired 
feature. Next, we give an example of such a potential. 

2.4. A 1-parameter family of examples. We give a simple family of ex- 
amples to illustrate the role of pressure. We write down a potential adapted 
to detecting regions with high GC content. Since we focus on a much broader 
and more sophisticated class of potentials in the rest of this paper, this ex- 
ample should be understood as an illustrative toy model. Let 1a denote the 
characteristic function of [A], and suppose that \w\ = 4 n + n — 1. Consider 
the family of functions 

(ft = lA + lr + t(lG + lc), 

where t > 0. Then 

n-l 

p(w,\og<pt) = J]^(a i n)= tGC(U) > 

uesw„(w) i=o uesWn(w) 

where GC(u) denotes the total number of occurrences of G and C in the 
word u. Then 

P(u/ 3 log</?i) = H top (w) 

and as t increases from 1, P(w, log <pt) gives a measure of complexity which 
assigns increasing importance to sequences with greater GC content. In 
§3.101 we investigate this family of potentials and how best to choose t in 
the context of CDS density estimation in the the human genome. 

3. Topological Pressure and CDS density estimation 

We show that pressure can be used as an effective predictor of coding 
sequence density for the human genome. The challenge is to make a good 
choice of potential function. The discovery of a potential function which 
correlates well with coding sequence density then yields biologically relevant 
information on the roles of different codons. 

3.1. Coding Sequence Density of the Human Genome. The coding 
sequence density is the probability density function representing the per- 
centage of coding sequences versus non-coding sequences in non-overlapping 
windows of a given size. We introduce some notation in order to define the 
coding sequence density precisely. 

Notation 3.1. Let Chr(i) denote the string which represents the i th chro- 
mosome of the human genome, and Chr(i, [n, m]) denote the substring which 
starts at position n and ends at position m. For convenience, we refer to 
the X and Y chromosomes as the 23 rd and 24 th chromosomes respectively. 
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We utilize the NCBI hgl8 build 36.3 with coding sequences defined by 
NCBI RefSeq genes and accessed via Wolfram's Mathematica 8.0 [44] • We 
choose a chromosome and fix an integer window size m to divide the chro- 
mosome into non-overlapping windows of length m. The most suitable win- 
dow sizes for comparision with topological pressure are those of the form 
m = 4 n + n — 1. 

Definition 3.1. For some fixed window size m = 4™ + n — 1, we define 

#CS(i,n,x) := #{Known coding sequences with initial nucleotide 
contained in Chr(i, [xm + 1, xm + to])}, 

assuming the chromosome is read in the p to q direction. The coding se- 
quence density is defined to be 

CDS(i,ra,x) := #CS(i,n,x)/#CS(i), 

where #CS(i) := #{Known coding sequences in Chr(i)}. 

Thus, the indices i and n tell us to look at the i th chromosome using a 
window of size 4 n + n — 1, and the index x describes the starting point of 
the window along the given chromosome. For fixed i and n, CDS(i,n, x) is 
a probability density function of x. 

3.2. Topological Pressure of the Human Genome. We now set up 

notation for our application of topological pressure to the human genome. 

Definition 3.2. For a potential ip > and m = 4 n + n — 1, let 

P hs (i, n, x, ip) := P(Chr(i, [mx + 1, mx + m]), log (p), 

where P(-,-) is the topological pressure defined in equation (12.5P . 

Thus, P hs (i , n, x, tp) is the topological pressure associated to the x th win- 
dow of size 4™ + n — 1 on the human chromosome i, using the potential 
log ip. 

3.3. Selection of ip via maximum correlation with CDS density. For 

fixed i and n, we consider CDS(i, n, x) and P hs (i, n, x, ip) as functions in x. 
We use the Nelder-Mead [29J method to maximize the correlation between 
P hs (i, n, x, tp) and CDS(?,n,x) with respect to potentials tp which depend 
on 3 symbols. Due to (|2.6h . we can without loss of generality restrict our 
attention to the set of potentials whose parameters sum to 1. We thus obtain 
a set of potentials (/'max whose associated pressure correlates very well with 
the coding sequence distribution in the chromosome Chr(i). We expand on 
our methodology below, and then present and analyze our results. 
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3.4. Methodology. Considered as functions in x, both CDS(i, n, x) and 
P hs (i, n,x,tp) are inherently noisy due to random fluctuations in nucleotide 
composition in a given chromosome as well as due to incomplete knowledge 
regarding coding sequences (eg. incorrectly annotated sequences). The noise 
in both functions is easily suppressed by utilizing a Gaussian filter (convo- 
lution with a Gaussian kernel of radius r). We checked that other standard 
smoothing techniques lead to similar results, and chose the Gaussian filter 
for our analysis due to its simplicity and speed of implementation. The fil- 
ter is applied after removing from both CDS(i, n, x) and P hs (i, n, x) those x 
where Chr(i, [xm + 1, xm + to]) contained any symbols besides {^4, C, T, G}. 
The radius of the Gaussian filter is chosen so that CDS(i,n,x) coincides 
at each x with the probability density function obtained from a Gaussian 
kernel density estimation of CDS(i,n, x) considered as a function of x: that 
is, we linearly interpolate the quantity 

1 ^ fx — CDS(i, n, Xj) 



(3.1) Y, k 



h*m ' \ h 

j = l 

where to = L ^jq^z^ J , k(u) = "y|^ e ~" 2 ^ 2 > an d the bandwidth h is selected 
according to Silverman's rule [39] . 

The selection of the window size in P hs (i,n,x,ip) exhibits the typical 
trade-off between sensitivity and specificity: a smaller window size allows 
for a finer approximation of the CDS distribution, but exhibits a higher 
sensitivity to fluctuations in nucleotide composition. We focus on a window 
size of 65,543 (n = 8), as this seems to achieve a good balance. This 
corresponds to dividing Chr(l) into roughly 3700 non-overlapping windows. 

After fixing i and n, we utilize the Nelder-Mead [29J method to maximize 
the correlation between P hs (i, n, x, (p) and CDS(i,n, x) with respect to po- 
tentials (p which depend on 3 symbols and whose parameters sum to one. 
The precision threshold for the convergence of this heuristic maximization 
technique was set to 10 -6 and convergence was typically achieved in 4000 
steps of the algorithm. We denote the potential thus obtained on the i th 
chromosome by v?max- 



3.5. Results. For each chromosome, we obtain a potential <^ max for which 
CDS(i,8,x) and P hs (i, 8, x, y max ) display very strong positive correlation. 
The value of the Pearson correlation coefficient on each chromosome is shown 
in figure [H and is above 0.9 in all cases. Figure [1] also demonstrates that 
topological entropy is not a good estimator of coding sequence density. This 
is unsurprising since we have no theoretical reason to expect correlation 
between entropy and coding sequence density since multiple intron and exon 
regions may be contained in windows of this size. The parameter values for 
each V'max can be found at 



http : //www. math. psu. edu/koslicki/potentials .xls 
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We also provide in figure [2] a plot of the standardized values of both 
CDS(5,8,x) and P hs (5, 8, x, 9? max ) to show the goodness of fit, and overlay 
these plots on the Ensemble Genome Browser [22] histogram of known genes. 

3.6. Comparison of the potentials ¥? m ax- Let r* = (r\, r\, . . . , rg 4 ) repre- 
sent the 64 parameters of the potential v? max . We show that the parameters 
for </? max exhibit a consistent codon bias by demonstrating that the proba- 
bility vectors r l are relatively close in the standard Euclidean metric. Figure 
[3] is a plot of the pairwise Euclidean distances between each of the chromo- 
somes. We have 

max d(r\ r 3 ') = .319 and mean d(r\ r j ) = .203 
i,je{i,...,24] i,j 6 {i,...,24} 

The sex chromosomes X and Y are clear outliers, so focusing on the auto- 
somes, these values improve to 

max d(r',r J ) = .284 and mean d(r\ r j ) = .195 

i,i6{l,-,22} i,ie{l,...,22} 

This is relatively close against a maximum possible distance of \f 7 l. 

In [23], it was observed that the sex chromosomes exhibit a distinctly 
different entropy distribution than the autosomes. This observation coin- 
cides with the fact that the sex chromosome potentials were furthest from 
the autosomal potentials. Interestingly, the potential (/'max corresponding 
to chromosome 7 was similarly distant from the other chromosomes. This 
is consistent with the fact that chromosome 7 contains many regions identi- 
cal to the sex chromosomes (of 30, 000 non-overlapping sequences of length 
5,000 from Chr(7), over 77% matched identically with a sequence in chro- 
mosome Y). 



3.7. The best choice of potential for CDS density estimation. We 

make a 'canonical' choice of potential for estimation of CDS density on 
the human genome by taking a suitable average of the potentials (/'max- 
There are various natural ways to do this, each yielding qualitatively similar 
results. The resulting potential is meaningful because, as shown in §3.61 the 
individual potentials are close to each other. 

Definition 3.3. We define a 'canonical' potential for detecting coding se- 
quence density in the human genome which we denote by ipi is . For each 
codon w, we let 

V?h S H := median{<£> max («;), . . . , ^(ro)}, 



and define 
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Figure 1. Correlation between pressure and CDS density 
Correlation Using Potential <^' max 
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Figure 2. Coding sequence density, pressure, and Ensembl 
known CDS histogram 



Standardized Value 




Other natural ways to obtain the 'canonical' potential would be to take 
the mean of the parameter values of each ^ ax , or to take the median/mean 
after omitting the outlying chromosomes X, Y and 7 from the data set. 
Each of these approaches yields a very similar potential. Alternatively, we 
can perform the maximization procedure on the sequence formed by con- 
catenating all the autosomes. This approach yields a potential which is close 
to 93hs (Euclidean distance less than .148) and qualitatively identical. 

We include a visualization of the parameters of </?hs i n figure [H The 
results of the correlation between the topological pressure P(i, 8, x, ip^) and 
CDS(i, 8,x) for the autosomes are contained in figure [5j 

3.8. Analysis of parameter values for ip^. We give an in-depth analysis 
of our canonical potential. As can be observed from figure U it is clear that 
iphs exhibits a distinct codon bias. We summarize and attempt to explain 
some of the most distinctive features of p^s '■ 

• The codons UCG, CCG, UAC, CGC, CGG, AGG and GGC are the 
most heavily weighted, and the codons CGU, ACU, GCG, UAU, 
UGA have quite strong weightings. 

• Codons which contain the pair CG or GC tend to be highly weighted 
(for example UCG, CCG, GCG, CGC, CGG, GCC). This is ex- 
plained by the well known connection between GC content and CDS 
density. However, we will see in §3.101 that basing a potential on 
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Figure 3. Pairwise Euclidean distance of the parameter val- 
ues for (/'max- The darker the square in position the 
greater the distance between r* and r J . 
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GC content alone is not sufficient for accurate estimation of CDS 
density. 

According to the weights in <^h s > the expected GC content of a se- 
quence is 58.4%, which is moderately high since it was shown in [35] 
that average GC content for a 100-kb segment of the human genome 
is between 35% and 60%. We calculate expected GC content by the 
formula 

^ <^ hs H(N G H+N c H)/3, 

w£A 3 

where Nq(w) denotes the number of times the letter G appears in 
the word w (similarly for Nc(u>)). This supports the commonly held 
notion that high GC content corresponds to high coding sequence 
density in the human genome [U [18] . 

The start codon (AUG) is weighted near zero. This may indicate 
that from a large scale perspective, start codons provide too weak a 
signal to utilize in estimating CDS density. 

The stop codons UGA, UAG, UAA exhibit a decreasing order of sig- 
nificance. This reflects the observation contained in 1411 that UGA 
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Figure 4. Plot of 50 times the parameter values of </?hs- The 
area of a square is equal to the corresponding potential value. 
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is utilized most frequently to terminate transcription in the human 
genome. Furthermore, this pattern of decreasing importance of stop 
codons reflects the alternate decoding of stop codons (see [TT], [4*0] . 
[45] . etc.). In particular, the two stop codons that can be alternately 
transcribed (UGA and UAG into Selenocysteine and Pyrrolysine re- 
spectively) are weighted much more strongly than UAA. 

• Codons made up of a single repeating nucleotide receive consistently 
low weights. This can be explained by the presence of long repetitive 
regions in non-coding regions. 

• We analyzed a number of physical properties associated to amino 
acids and codons (e.g. acidity, polarity, hydropathy, etc.), and found 
a weak (.293) but statistically significant (p < .025) correlation be- 
tween the values of <^hs and heat of combustion of the corresponding 
codons. We are not aware of any results in the literature which 
would give a theoretical basis for this observation. 
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Figure 5. Plot of Pearson correlation coefficient between 
the coding sequence density of the human genome and the 
topological pressure associated to the potential <£>hs- 
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Synonymous codons may receive very different weightings. For ex- 
ample, among the codons which specify Glycine, GGC is strongly 
weighted but GGU, GGA and GGG are all weighted near zero. 
Of the five amino acids with four-fold degenerate sites, distinct codon 
bias is observed: each amino acid with a four- fold degenerate site 
shows a clear bias towards a nucleotide in the third site (with the 
exception of Proline where two particular codons are favored) . This 
corresponds with the observations of previous comparative studies, 
for example [M] , where it was observed that there exists selectively 
driven codon usage at four-fold degenerate sites for mammals (with 
a weak bias towards C). Our study suggests that any of the four 
nucleotides may be favored in the third position (A for Val, G for 
Ala, C for Gly, U for Thr). 

Amino acids with twofold degenerate sites seem typically to carry 
similar weightings. For example, GAA and GAG both have neg- 
ligible weightings, while UAU and UAC are both weighted quite 
strongly. The mean variance of the weighting at twofold degenerate 
sites was 4.7 x 1CP 5 while the mean variance over all amino acids 
was 1.6 x 1(T 4 . 

For most amino acids, either exactly one codon is weighted strongly 
(Leu, Val, Ser, Thr, Ala, His, Gly) (or at least more strongly than 
the others (His, Gin, Asn)), or no codons are weighted strongly 
(Phe, He, Met, Lys, Asp, Glu, Cys, Trp). A notable exception is 
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Arginine where three out of its six synonymous codons are weighted 
strongly. This may suggest that Arginine has a particularly impor- 
tant role. This could reflect the evolutionary pressure exerted on 
Arginine as observed in [23], where it is noted that Arginine has a 
much lower frequency of appearance than expected. Recently, in |25| 
it was shown that in yeast, preferential synonymous codon usage for 
Arginine greatly affects expression levels via influencing translational 
efficiency Our results may indicate that a similar phenomenon oc- 
curs in the human genome, as this would be another explanation for 
the strong weighting of Arginine. 

3.9. Selecting potentials using intron/exon density. Many single se- 
quence techniques for measuring the coding potential of DNA sequences are 
based upon frequencies of codons or n-mers in known intronic and exonic 
regions [U [91 [101 [21] . We can use this principle to write down potentials 
Vintron an0 ^ V'exon which are based simply on the frequency of codons in the 
intron (or exon) sequences. 

More precisely, we let Introns(i) denote the collection of all segments 
of chromosome i which correspond to known intron regions. For each w 6 
Introns(i), we let N v (w) denote the number of times a given codon v appears 
in w, and we note that the total number of codons (with overlap) in w is 
\w\ — 2. We define a potential ¥?} ntron by assigning each codon a weight by 
the formula 

/ \ N v (w) 
(3-2) </4tro»:= E tJH" 

ui£lntrons(i) 

We define potentials f l exon analogously, using the frequencies of codons that 
appear in known exon regions of chromosome i. Finally, as in definition 13.3^ 
let ^exon(^) : = m edian{(^ xon (u;), . . . , v?^ on (w)} and define tp exOQ := Zo xon |i - 

II rcxon II 

As one would expect, the pressure taken with respect to the potentials 
^intron ( res P- vlxon) tends to have significant negative (resp. positive) corre- 
lation with the coding sequence density: see figure The mean correlation 
between P(i, 8, x, (p\ ntmB ) and CDS(«, 8,x) was —.531. The mean correla- 
tion between P(i, 8, x, Vexon) an d CDS(i,8,x) was .376. While this clearly 
shows a correlation, it is significantly weaker than that obtained using the 
potentials y?' max . We conclude that potentials which are based simply on 
frequencies of occurrence of codons in intron/exon regions are useful to an 
extent, but that more sophisticated potentials, such as (^hs ; yield much bet- 
ter results. See figure [6] for a comparison of the potentials (/?hs and v^exon- 

3.10. Selecting potentials to detect GC content. We investigate the 
pressure of the family of potentials introduced in §2.41 

ip t = l A + l T + t(l G + l c ), 

where t > 0. It is a commonly held notion that high GC content corresponds 
to high coding sequence density in the human genome (see §3.8h . We give 
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Figure 6. Visualization of parameter values of potentials. 
The area of a square is equal to the corresponding potential 
value. 
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evidence that the link between GC content and CDS density is significant 
but weak. 

For chromosome 1, we find that as t varies, the largest correlation be- 
tween P hs (l,8, x,tpt) and CDS(l,8,x) is .138. This maximum is attained 
(uniquely) when t = 10.308. Over all the chromosomes, the maximum 
correlation of P hs (i, 8, x, ipt) and CDS(i,8, x) has a statistically significant 
(p < .0005) mean of 0.121 with a variance of .00359. This maximum is 
achieved for a mean parameter value of t = 10.306 with a variance of 15.780. 
The outliers were chromosome 18, which achieves maximal correlation at 
t = 21.246, and chromosome 15, which achieves maximal correlation at 
t = 0. Excluding these two chromosomes gives essentially the same mean 
(t = 10.273), but a much improved variance of 4.98. These results indicate 
that potentials based on GC content give a weak positive correlation with 
CDS density. However, the much higher correlation obtained when using 
the potential (p^ indicates that considering GC content alone is far from 
optimal in CDS density estimation. 

4. Application of the Potential tp^ to the Mouse Genome 

We further illustrate the biological significance of the potential cp^ by ex- 
amining the correlation between the coding sequence density of the mouse 
genome and the topological pressure associated to the potential ip^s- Fol- 
lowing the setup of section I3T41 we retrieve the mouse genome (build mm9, 
NCBI build 37) from the UCSC database (15] via Galaxy |17| . extract 
the RefSeq genes, and then define CDS mm (i, n, x) and P mm (i,n,x) for the 
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Figure 7. Correlation between pressure and coding se- 
quence distribution 
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mouse autosomes. The results of the correlation between the topolog- 
ical pressure P mm (i, 8, x, </?h s ) associated to the potential obtained from 
the human genome and the coding sequence density of the mouse genome 
CDS mm (i, 8, x) are contained in figure We see a strong positive corre- 
lation. This indicates that codon usage in the mouse is similar to codon 
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usage in humans, and gives further evidence that the potential 92hs gen- 
uinely encodes biological information relevant to detecting coding sequence 
distributions, even across different species. 



Figure 8. Plot of Pearson correlation coefficient between 
the coding sequence density of the Mus Musculus genome 
and the topological pressure associated to the potential ifh s . 
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We follow the maximization procedure outlined in §3.41 to obtain poten- 
tials V'mmmax that maximize the correlation between P mm (i, 8, x, (p) and 
CDS mm (i, 8, x) with respect to 92. Following section §3.71 we average over 
the potentials (f l mm max to obtain a 'canonical' potential for the mouse, which 
we denote by </9 mm . In figure El we include a visualization of the parameter 
values for <£ mm and c^hs to demonstrate the similarities between them. It 
will be an interesting project to carry out this procedure for a much larger 
collection of species and investigate the similarities and differences between 
the potentials that are selected for each species. 

5. Equilibrium measures and DNA 

As mentioned in the introduction, an important area of research is to 
develop single sequence measures that effectively distinguish between short 
coding sequences and short non-coding sequences. Here, we utilize ergodic 
theory to develop such a measure. 

Given a locally constant function if) on „4 N , the theory of thermodynamic 
formalism gives us a means of selecting a Markov measure fj,^, known as 
the equilibrium measure for if). We adapt this theory to the case of finite 
sequences. The measure thus obtained reflects the properties of the function 
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Figure 9 . Plot of parameter values of median potentials for 
human and mouse respectively. The area of a square is equal 
to the corresponding potential value. 
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ijj. We carry out this procedure for our canonical potential (p^ an d obtain a 
measure that is effective for the analysis of relatively short segments of DNA 
sequences. We give a brief review of the theory of equilibrium measures in 
the case of potentials which depend on 3 symbols, and show that the measure 
thus obtained is meaningful in the finite setting also. We then give numerical 
results to demonstrate that our measure can distinguish between coding and 
non-coding sequences with a reasonably high probability of success. 

5.1. Constructing Markov measures from potentials. We are primar- 
ily concerned with functions that depend on 3 symbols, using the alphabet 
A = {A, C, T, G}. That is, we consider potentials ip = log ip, where 

(5.1) <p = ^2 t w l w , 

w&A 3 

so that each t w > is a parameter associated to the word w £ A 3 . We 
review how this this function defines a Markov measure with memory 2 on 
S. The presentation here is a special case of more general expositions given 
in [21 El ESI ED 132 02]. Let B = {A,C,G,T} 2 . Enumerate B by some 
natural ordering. For example, let 

ioi = AA, w 2 = AC, u> 3 = AG, u> 4 = AT, w 5 = CA, ...,w ie = TT 

Define a 1 — square matrix S of dimension 16 as follows. Let S{j = 1 if and 
only if the second letter in W{ is the same as the first letter in Wj. Otherwise, 
set Sn = 0. 
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We now use the potential tj) to define a non-negative matrix M of dimen- 
sion 16 as follows. Let gij = logi^, where if Wi = IJ, and Wj = JK, then 
w = UK. Let g(i,j) = if the second letter in Wi is not the same as the 
first letter in Wj. We define M by 

(5.2) My = S ije 9 ^\ 

The Perron-Frobenius theorem gives a maximal eigenvalue A > and a 
strictly positive vector r such that 

Mr = Xr. 

Now define a matrix P of dimension 16 by 

„ Man 

(5.3) P„ = ^ 

It is easy to check that is a stochastic matrix and that there is a unique 
probability vector p so that j>P = p. More explicitly, p% is given by nor- 
malizing the vector 1^, where I is a strictly positive left eigenvector for M. 
For a, b, c £ A, let p(ab) = pi when a6 = itfj, and let P(ab,bc) = Pij when 
a6 = t^j and 6c = u>j. We use the pair (p, P) to define a measure as follows. 

Definition 5.1. We define a probability measure /iw, on A^ , or A n for any 

fixed n > 3, by the formula 

fJ^([xi ■ . . X k \) = p(x 1 X 2 )P(x 1 X2,X2X 3 )P(x 2 X3, £3X4) . . . P(x k _ 2 Xk-l , X k - X X k ) . 

We call the measure fi^ the equilibrium measure for ij;. 

5.2. Properties of the equilibrium measure. We recall the classical 
theory from dynamical systems which explains the importance of First, 
we recall the definition of topological pressure for the full shift. 

Definition 5.2. The topological pressure of ip on the full shift E over an 
alphabet A is defined to be: 

P(E,V) = lim -log I V expVVVu)] . 



\«e.4 n i=o 



The following result gives the fundamental relationship between pressure 
and invariant measures 



Theorem 5.1 (Variational Principle). P(Y,,ifi) = sup m {/i m + f ifidm} , 
where the supremum is taken over all a-invariant probability measures on 
E, and h m denotes the measure theoretic entropy, given by 

h m = lim — m(M)logm(M). 

n— >oo n — J 

The variational principle illustrates the trade off between structure and 
complexity which is detected by pressure. Pressure effectively balances the 
inherent randomness in a sequence while still reflecting the emphasis encoded 
by the potential tp. Pressure simultaneously maximizes entropy (which is 
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maximized by the uniform measure) and the average value of the potential 
(the integral itself is maximized by a Dirac measure) . A measure achieving 
the supremum in the variational principle is called an equilibrium measure 
for ip. The following result, proved in [32], §4], tells us that the measure 
constructed in the previous section is indeed an equilibrium measure. 

Theorem 5.2. The Markov measure \i = fj,^ is the unique equilibrium mea- 
sure for ip and 

P(E,i/}) = h li + J = log A, 

where A is the Perron- Frobenious eigenvalue of the matrix (|5.2p . 

The relationship between ip and (J,^ is captured by the Gibbs property, 
established in HE!]. 

Theorem 5.3 (Gibbs property). For ip = log(£> defined as in (|5.ip and any 

w£A n , 

n-2 

/MM) ~ exp{-nP(S, V) + J>K +2 )}, 

i=i 

where a n X b n means there exists a constant C > 1 so that C _1 < a n /b n < C 
for all n. 

Thus, if we normalise ip so that P(H,ip) = (which is done by taking a 
suitable multiple of if), then 

n-2 

(5.4) ^(H)xn^, i+2 ). 

i=l 



5.3. Relationship between the equilibrium measure and pressure 
for finite sequences. We apply the theory developed in §5.21 to finite se- 
quences. The proof of the following result is similar to that of [42, Theorem 
7.30]. 

Theorem 5.4. Whenip depends on 3 symbols, Pmaxi^^) = log HM™" 2 !! 1 /™, 
where M is the matrix constructed in (15. 2p . Since ||M n ~ 2 || 1 / n converges to 
A exponentially fast as n — > oo, then for large n, Pmax(^> n ) is very close to 
log A. 

Thus, the number log A is still important for finite sequences. The for- 
mula (15. 4p reveals the meaning of the measure ji^. Sequences which have a 
relatively high frequency of words w G -4 3 where t w is large, and a relatively 
small frequency of words w G A 3 where t w is small, will have relatively 
large measure. This gives a theoretical underpinning for using fj,^ to predict 
coding sequence density. 
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5.4. An equilibrium measure for CDS density estimation. We show 
that the equilibrium measure has practical applications to distinguishing 
between coding and non-coding DNA sequences. Recall that in $3] we found 
a potential <^hs f° r which the pressure of human DNA segments of length 
65, 536bp has strong positive correlation with the coding sequence density. 
We now show that the equilibrium measure associated to log^hs, which 
we denote by /%3, can distinguish between coding and non-coding DNA 
sequences with a reasonable degree of success. 

The advantage of using the measure /Uhs rather than pressure associated 
to (fbs is that the measure is effective in analyzing relatively short DNA 
sequences (10bp-5000bp). Indeed, when ^ depends on 3 symbols, pressure 
is only defined for sequences of length at least 4 4 — 4 = 251, and only 
becomes an effective tool for much longer sequences where the noise inherent 
in the calculation of pressure is effectively suppressed. While the equilibrium 
measure is a cruder tool than the pressure, it is nevertheless effective for 
analyzing shorter sequences where the pressure is unavailable. 

To demonstrate this phenomena, we show that the measure /i^s can par- 
tially distinguish between a randomly selected assortment of intron and exon 
sequences that are more than an order of magnitude shorter than the se- 
quences on which pressure was evaluated. We randomly select 5, 000 intron 
sequences and 5,000 exon sequences from Chr(l), each of length 5,000bp. 
These sequences are completely un-preprocessed: no information such as 
ORF's, stop/start codons or repeat masking is utilized. As expected, the 
measure /ihs reflects the properties of the potential ip^: exon sequences are 
typically weighted more heavily than intron sequences. This is demonstrated 
by figure [TUl which shows the histogram of log(/Xhs) evaluated on the test 
sequences. We also include the ROC curve (true positive rate vs. false 
positive rate) associated to /Xhs- The area under the curve (AUC) is 0.722. 



Figure 10. Histogram of log(/Uhs) evaluated on the test set 
of introns and exons 

1000: 
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Figure 11. ROC curve associated to ^ 
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6. Conclusion 

We have introduced a definition of topological pressure for finite sequences 
inspired by, and related to, the classical definition from ergodic theory. We 
have applied this definition to DNA sequences in four distinct fashions. First, 
we obtained a potential that effectively estimates the distribution of coding 
sequences across the human genome. Second, we gave a detailed analysis of 
this potential to give new evidence about which codons are most important 
in coding sequence density estimation. Thus, pressure can be used as a tool 
for the study of synonymous codon usage. Our analysis effectively measures 
which codons are most important and not simply most frequently appearing. 
Third, we used topological pressure to compare coding sequence density in 
the human and the mouse genome, giving evidence via pressure that codon 
usage is similar across both species. Lastly, we derived the equilibrium 
measure associated to a particular potential and showed that this can be 
used to distinguish between relatively short reads of coding and non-coding 
sequences. 

This study has indicated that topological pressure may help elucidate 
the nuanced problems of mammalian codon bias. Since topological pres- 
sure does not rely on a particular statistical perspective but is motivated 
by a rigorous implementation of a well developed mathematical theory, we 
expect that our approach will yield many further applications in genomic 
analysis in the future. We expect that the inclusion of pressure in compara- 
tive studies will contribute to the understanding of the relationship between 
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codon bias and gene expression levels. Furthermore, the ability of equilib- 
rium measures to succinctly encapsulate information obtained on very large 
scales indicates the usefulness of pressure for the development of measures 
of coding potential for short DNA sequences. 
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